Back

IEEE Transactions on Computational Biology and Bioinformatics

Institute of Electrical and Electronics Engineers (IEEE)

Preprints posted in the last 30 days, ranked by how well they match IEEE Transactions on Computational Biology and Bioinformatics's content profile, based on 17 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
MOSAIC: Model-based, Subgroup-Aware Identification of Driver Mutations in Cancer

Campbell, K.; Reyna, M. A.

2026-05-03 bioinformatics 10.64898/2026.04.29.721672 medRxiv
Top 0.1%
3.6%
Show abstract

In cancer genomics, recurrent patterns of mutual exclusivity within a gene set can indicate shared biological context and involvement in tumorigenesis. However, existing methods are not designed to distinguish between mutual exclusivity arising from meaningful biological interactions from those influenced by heterogeneity between underlying patient subpopulations. In this work, we introduce MOSAIC, a novel statistical framework that models patient subgroup heterogeneity in mutual exclusivity analyses. In experiments with simulated data and real data from The Cancer Genome Atlas, we show that MOSAIC amplifies subgroup-specific mutual exclusivity signals, including between IDH1 and IDH2 in young low grade glioma patients, while reducing the effect of signals produced by underlying subgroup structures, such as distinct genomic lineages associated with histological subtypes of endometrial cancer. Finally, we demonstrate that MOSAIC is more powerful than existing p-value combination methods for patient subgroup stratification. MOSAIC is available as an open-source tool at https://github.com/reynalab/mosaic.

2
DistPCA: Tera-Scale Genomic PCA via Out-of-Core Distributed Parallelism

Mermigkis, G.; Sofotasios, A.; Kontopoulou, E.-M.; Gallopoulos, E.; Hadjidoukas, P.

2026-05-19 bioinformatics 10.64898/2026.05.15.725487 medRxiv
Top 0.1%
2.1%
Show abstract

Principal Component Analysis (PCA) is a fundamental tool in human genetics, widely used to study population structure. However, the rapid growth of modern genomic datasets, which often exceed main memory capacity, renders traditional PCA methods infeasible, motivating out-of-core approaches. Prior work on out-of-core genomic PCA has focused primarily on optimizing the inherently compute-intensive numerical core, largely overlooking the stages of data I/O and preprocessing, which emerge as significant performance bottlenecks at tera-scale. Furthermore, existing approaches remain limited to shared-memory single-node architectures, lacking support for distributed multi-node environments. To address these limitations, we introduce DistPCA, the first distributed out-of-core framework for tera-scale genomic PCA, implemented as a C++ library and scalable across both single- and multi-node systems. Built on top of Message Passage Interface (MPI), the proposed framework employs multi-level data parallelism across the entire PCA pipeline, combining multiprocessing, multithreading, SIMD vectorization, and compute-transfer overlap, while remaining compatible with block-based methods that rely on associative operations. Extensive evaluation on real and synthetic datasets demonstrates near-linear scalability, achieving speedups of up to 58.2x and over 98% reduction in wall-clock time, while maintaining parallel efficiency above 82% and preserving accuracy in the recovered principal components.

3
Stereochemistry-Aware Drug-Target Affinity Prediction

Ferreyra, S.; Dutra, I.; Galeano, A.; Paccanaro, A.

2026-05-18 bioinformatics 10.64898/2026.05.14.725200 medRxiv
Top 0.2%
1.7%
Show abstract

Drug-target affinity (DTA) prediction is a key task in drug discovery, enabling the estimation of the interaction strength between candidate compounds and biological targets. However, current models rely on connectivity-based molecular representations and do not explicitly account for the spatial organization, also known as stereochemistry. This limitation becomes evident when considering chirality, where a drug can exist as enantiomers, i.e., molecules that share the same atoms and bonds but differ in their three-dimensional arrangement. Despite their chemical similarity, they can interact differently with the same target, leading to variations in binding affinity and biological activity. In this paper, we propose a stereochemistry-aware DTA prediction framework that incorporates this information into molecular representations. Drug representations are learned from chemical structure using a directed-bond message passing graph neural network that captures enantiomers configurations, while protein targets are represented through sequence-based embeddings. Experiments on the Davis dataset demonstrate that our model can improve affinity prediction. Importantly, a case study on a manually curated dataset of enantiomers with different biological action shows that the model is able to distinguish the affinities in the two forms consistent with their experimentally observed biological activity. These findings support the relevance of stereochemistry-aware molecular representation for more accurate and chemically faithful DTA prediction.

4
Counterfactual Explanations for Graph Neural Networks in Patient Outcome Prediction

Chaidos, N.; Dimitriou, A.; Calzi, H.; Casiraghi, E.; Stamou, G.; Valentini, G.

2026-05-20 bioinformatics 10.64898/2026.05.18.725906 medRxiv
Top 0.2%
1.7%
Show abstract

Counterfactual Explanation (CE) algorithms have been successfully applied to uncover the main factors driving computational diagnostic and prognostic predictions on tabular medical data. Recently, a new Network Medicine paradigm has been introduced for patient diagnosis and prognosis using Patient Similarity Networks (PSNs), i.e. graphs where patients are represented as nodes and their clinical and biomolecular similarities as edges. In this context, graph-based algorithms, including Graph Neural Networks (GNNs), can provide predictions using not only individual patient features but also their relations within a network of clinically and biomolecularly similar individuals. In this work, we propose the first CE algorithm tailored to explain diagnostic and prognostic predictions within PSNs. Alongside a contrastive GNN backbone, we introduce a versatile, model-agnostic counterfactual search method compatible with any underlying classifier. Preliminary results on synthetic data and on a cohort of patients affected by the Alzheimers disease show that our algorithm is competitive both with seminal tabular based CE algorithms and GNNExplainer, a well-established method for explaining graph-based classification tasks.

5
TopoFuseNet: Hierarchical Graph Representation Learning with Multi-Scale Topological Features for Accurate Drug Synergy Prediction

Wang, Q.; Shi, x.

2026-05-08 bioinformatics 10.64898/2026.05.05.722940 medRxiv
Top 0.2%
1.7%
Show abstract

Accurate prediction of drug synergy is paramount for developing effective combination therapies and advancing personalized medicine. Although methods based on graph neural networks (GNNs) have become a prevalent approach, they often treat molecules as flat graphs of connected atoms, thus overlooking their inherent hierarchical structure (i.e., atoms forming functional groups) and the critical topological information that governs molecular interactions. To address this limitation, we introduce TopoFuseNet, a novel hierarchical graph representation learning framework that integrates multi-scale topological features. The core innovations of TopoFuseNet include: 1) The first-ever application of "Group Centrality" from network science to cheminformatics, enabling the identification and quantification of functional groups crucial to drug activity; 2) A systematic, multi- path strategy to seamlessly integrate node-level (atom) and group-level (functional group) topological features into a Graph Attention Network (GAT) via feature augmentation, attention biasing, and hierarchical pooling; 3) A Differential Transformer module to deeply fuse multi-modal features learned from sequences, fingerprints, and our proposed hierarchical graph representations. Extensive experiments on two large-scale benchmark datasets, DrugComb and DrugCombDB, demonstrate that TopoFuseNet significantly outperforms state-of-the-art methods across multiple key metrics, including AUC, AUPRC, and F1-score, while exhibiting exceptional generalization robustness under various stringent cold-start scenarios. In-depth ablation studies further confirm the effectiveness and necessity of each proposed innovative module. Furthermore, multi-scale interpretability analysis and zero-shot cross-domain transfer experiments reveal that the model successfully captures molecular interaction rules with clear pharmacological significance, demonstrating immense practical potential for discovering novel combination therapies through large-scale virtual screening. Our work not only delivers a superior model for drug synergy prediction, but more importantly, it establishes a novel and scalable paradigm for effectively integrating hierarchical molecular structures and topological information into GNNs.

6
Benchmarking long-context genome language models on biosynthetic gene clusters

Hirota, K.; Higashi, K.; Kurokawa, K.; Yamada, T.

2026-05-15 bioinformatics 10.64898/2026.05.12.724296 medRxiv
Top 0.2%
1.6%
Show abstract

Recent advances in language models for natural language processing have spread to the field of genomics, driving the development of genome language models (gLMs) to decipher genomic information. Cutting-edge long-context gLMs are promising approaches for understanding and designing biological complexity, but their evaluation remains underdeveloped. In this study, we introduce BGCs-Bench, a unified benchmark focused on biosynthetic gene clusters for assessing long-range genomic modeling on three downstream tasks: biosynthetic class prediction, taxonomic classification and coding sequence annotation. Using BGCs-Bench, we perform systematic and layer-wise evaluations of the embedding representations of long-context gLMs, demonstrating that layer selection is crucial for downstream task performance. In addition to the evaluation results, the logit lens analysis of autoregressive gLMs suggests that StripedHyena-based models consist of earlier layers to encode biologically meaningful information from input DNA sequences and deeper layers to optimize embeddings for sequence generation. These findings provide insights for more effective development and application of long-context gLMs.

7
A lightweight codon-based DNA Transformer for Regulatory Region Identification in the Genome

Karthik, A. S. P.; Das, A. B.

2026-05-07 bioinformatics 10.64898/2026.05.04.722647 medRxiv
Top 0.3%
1.5%
Show abstract

We developed a lightweight codon-based DNA Transformer equipped with multi-head self-attention and an adaptive classifier head, which achieves exon intron classification with high accuracy and also has moderate accuracy in CDS classification and splice site recognition. We named this model as ExIT (Exon-Intron Transformer). We have implemented codon tokenization for this model. This has been validated on the human genome with external validation from the chimpanzee genome. Further benchmarking has implied that our model is better than the existing models in the above tasks.

8
Guidance for high-quality functional gene embeddings from large language models

Huang, R.; Hou, Y.; Zhao, W.; Zhang, J.; Lu, J.; Kong, Y.; Xu, P.

2026-05-04 bioinformatics 10.64898/2026.04.30.721875 medRxiv
Top 0.3%
1.3%
Show abstract

Large language models (LLMs) are increasingly used to generate gene embeddings, yet systematic benchmarks of prompting strategies and practical guidance for obtaining biologically meaningful representations remain limited. Here we present GEbench, an evaluation framework for assessing LLM-derived gene embeddings across different tasks, prompting strategies, and LLM architectures. GEbench revealed that embedding quality depends primarily on whether the input text contains explicit functional information, rather than on sparse gene identifiers or model size. Identifier-based embeddings showed weak biological organization, whereas embeddings derived from functional descriptions consistently achieved stronger functional separation and predictive performance. Notably, Self-Des, which extracts embeddings from model-generated gene function descriptions, enabled locally deployable LLMs to generate high-fidelity representations that approach the quality of expert-curated databases. Genome-scale analyses further supported these findings, indicating that explicit functional descriptions are an effective design principle for generating high-quality gene embeddings from LLMs.

9
Cell-Level Virtual Screening

Ellington, C. N.; Addagudi, S.; Wang, J.; Lengerich, B. J.; Xing, E. P.

2026-05-13 bioinformatics 10.64898/2026.05.11.724149 medRxiv
Top 0.3%
1.3%
Show abstract

Virtual screening methods prioritize therapeutic candidates by predicting molecular properties and interactions. However, molecular models are insufficient to predict higher-order effects that arise in real biological systems, leading to late-stage failures in drug discovery. Virtual cells have been posed as a solution to this problem by predicting gene expression responses to drugs, but they remain weakly validated as screening tools; gene expression is only an intermediate in understanding drug success or failure. Despite burgeoning progress in virtual cells, some basic questions remain. Is expression even a good representation of higher-order drug effects? How can expression and other cell-level representations be applied to prioritize therapeutic candidates? Can cell-level methods be fairly compared against traditional molecular-level screens? We address these questions in a two-pronged approach. First, we curate two benchmarks, Drug-Disease Retrieval Bench (DDR-Bench) and Drug-Target Retrieval Bench (DTR-Bench), which directly compare cell-level methods against traditional molecular methods on canonical drug discovery tasks. DDR-Bench evaluates a methods ability to prioritize disease indications for drugs with novel target profiles. DTR-Bench evaluates a methods ability to reconstruct drug-target interactions from separate perturbation modalities that act on shared mechanisms, bridging the gap between cell-level methods and classic molecular screens. We identify shortcomings of existing screening methods on these benchmarks, and propose an alternative representation of drug effects: perturbed gene networks. Inferring post-perturbation gene networks on-demand for unseen drugs requires methods that generalize beyond traditional plug-in network estimators. We develop a scalable differentiable surrogate loss for multivariate Gaussians, which we apply to train a context-adaptive amortized estimator that maps perturbation metadata to gene-gene dependency network parameters. The resulting model, CellVS-Net, achieves SOTA on predicting how gene networks restructure under a variety of complex multivariate experimental conditions, including different cell types, small molecule therapeutics, signaling molecules, gene knockdowns, and gene over-expressions. When compared to other molecular and cell-level representations of drugs, we find that CellVS-Net achieves SOTA on both virtual screening benchmarks. Overall, CellVS-Net demonstrates that cell-level virtual screening methods are a viable alternative to molecular screening, and associated benchmarks enable hill-climbing on relevant drug discovery tasks.

10
An generative-AI framework for target-Specific MicroRNAs towards RNAi-based drug design

Gu, J.; Li, Y.

2026-05-11 genomics 10.64898/2026.05.07.723585 medRxiv
Top 0.4%
1.1%
Show abstract

MicroRNA (miRNAs) are small non-coding RNAs that regulate gene expression by binding to the target messenger RNA (mRNA), whose versatility has inspired RNA-interference (RNAi)-based drug designs. However, off-target effects lead to unintended gene silencing and toxicity. Existing methods suffer from experimental data scarcity and fail to effectively integrate target specificity into designing de novo small interference RNAs (siRNA). To overcome the above challenges, we present SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR, a specificity-guided generative framework that synthesizes target-conditioned miRNAs. By training on a large experimental data containing 2.2M miRNA-mRNA pairs, SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR minimizes off-target effects with enhanced on-target potency. As a result, SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR-generated miRNAs bind more strongly to the target mRNAs than the observed miRNAs and much less so to off-target mRNAs. We tested SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR on mRNA targets for liver disease, for which 6 FDA-approved siRNA-based drugs were available. SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR recovers binding regions that correspond to FDA-approved siRNA drugs across 3 targets, and demonstrates greater structural specificity for on-target mRNAs than for off-target mRNAs. Together, SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR offers an AI solution to synthesize miRNA-inspired and target-specific siRNA sequences towards RNAi-based drug design.

11
SRSA-VAE: Self-Attention-Based Feature Learning for Single-Cell Multimodal Clustering

Das, R.; Dey, A.; Maulik, U.; Bandyopadhyay, S.

2026-05-11 bioinformatics 10.64898/2026.05.06.723212 medRxiv
Top 0.4%
1.1%
Show abstract

Clustering plays a critical role in the analysis of single-cell omics data for identifying cellular heterogeneity and uncovering biological mechanisms. However, the high dimensionality, sparsity, and multimodal nature of single-cell datasets such as single-cell RNA sequencing (scRNA-seq) and Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) pose significant challenges for effective feature learning and representation learning. Traditional dimensionality reduction methods often rely on linear transformations and fail to capture complex nonlinear relationships between gene and protein expression profiles. In this work, we propose SRSA-VAE, a scalable variational autoencoder framework that integrates a residual self-attention encoder for context-aware feature learning and multimodal representation learning. The proposed model dynamically contextualizes gene and protein representations through a self-attention mechanism, enabling the encoder to capture inter-cell relationships and emphasize biologically informative signals. A scalable residual connection further stabilizes training and preserves essential input information during latent representation learning. We evaluate SRSA-VAE on five large-scale publicly available single-cell datasets, including both scRNA-seq and CITE-seq data, and compare its performance with established deep generative models. Experimental results demonstrate that SRSA-VAE consistently outperforms existing methods in Adjusted Rand Index (ARI) across benchmark datasets, with particularly strong gains on complex immune cell populations. Ablation studies further confirm the importance of the self-attention mechanism and residual connection in enhancing model stability and clustering accuracy. The proposed model offers a generalizable, robust, and scalable solution for single-cell clustering tasks. Code Repositoryhttps://github.com/rangan2510/srsa-vae

12
Nanopore event detection in a simple and adaptive way

Wei, P.; Kansari, M.; Mierzejewski, M.; Ensslen, T.; Lin, C.-Y.; Kavetsky, K.; Jones, P. D.; Behrends, J. C.; Drndic, M.; Fyta, M.

2026-05-11 bioinformatics 10.64898/2026.05.07.723187 medRxiv
Top 0.5%
0.9%
Show abstract

Nanopore read-out, that is the current signals measured across nanometer-sized openings in dielectric membranes or through natural protein channels, enables the detection, identification and sequencing of individual molecules. The detection can take place by analyzing the events of single biomolecules interacting with the pore. The accuracy in the detection of these single events is key for identification of physicochemical properties of analyte molecules. To this end, we further develop a very simple, fast, almost parameter-free, and adaptable cluster-based event detection (CBED) algorithm that clusters the nanopore signals prior to detecting nanopore events. The algorithm is validated against two other event detection schemes with respect to simplicity and efficiency. For this, nanopore data from four different experiments stemming from different laboratories that vary in the nanopore type, size, and analyte are considered. The comparison is made on the basis of the number of events detected, their quality, and the most important features extracted from nanopore events. Our results underline the higher efficiency and less noise of the CBED detected events for biological nanopore data and the need for an on-the-fly adaptivity of the baseline current for a class of solid-state nanopore data.

13
HiCP2GAN: A Plug and Play Foundation Model-based GAN for Hi-C Enhancement

Olowofila, S.; Oluwadare, O.

2026-05-20 bioinformatics 10.64898/2026.05.18.725960 medRxiv
Top 0.6%
0.8%
Show abstract

The three-dimensional organization of chromatin shapes gene regulation and cellular function. Hi-C has emerged as the primary technique for mapping chromatin interactions genome-wide, yet high-resolution data remain costly and scarce, leaving many studies with sparse contact maps that limit downstream analysis. Deep learning methods, especially generative adversarial networks (GANs), have shown promise for enhancing low-resolution Hi-C data. Most existing GAN-based approaches, however, rely on custom discriminators trained from scratch, which can yield unstable training and limited generalization. Hi-C foundation models pretrained on large-scale data capture rich, transferable representations of chromatin structure; their use as discriminators within adversarial enhancement frameworks has not been explored. In this work, we introduce HiCP2GAN, a plug-and-play GAN that employs a pretrained Vision Transformer-based Hi-C foundation model as its discriminator. The discriminator was pretrained on 118 million Hi-C patches across diverse species and cell types, providing biologically meaningful gradients for adversarial supervision. The HiCP2GAN framework is generator-agnostic: any compatible Hi-C resolution enhancement architecture can serve as the generator, enabling fair comparison across methods. The encoder phase of the foundation model was adapted as a discriminator backbone and experimented with finetuning different numbers of layers from the input while freezing the deeper transformer layers. Finetuning the first few layers while freezing the rest preserved pretrained knowledge while allowing task-specific adaptation. Experiments on human cell lines show that HiCP2GAN consistently improves resolution over standalone generators and conventional GAN-based models, while serving as a plug-and-play framework for most non-GAN generator models. HiCP2GAN is publicly available at https://github.com/OluwadareLab/HiCP2GAN.

14
Dual-Stream Compression of High Bit-Depth Medical Images with Application to DNA Storage

Su, H.; Fan, W.; Peng, J.; Zhang, Y.

2026-05-20 bioinformatics 10.64898/2026.05.17.724501 medRxiv
Top 0.6%
0.8%
Show abstract

High bit-depth medical images preserve subtle intensity variations that are important for quantitative analysis and clinical interpretation, but their large dynamic range poses challenges for efficient compression. We propose a bit-plane-aware dual-stream compression framework for 16-bit medical images by separately modeling the most significant bit (MSB) and least significant bit (LSB) components. The MSB structural stream is encoded using JPEG coding with a Duplicate Segment Skipping (DSS) strategy to exploit spatial and segment-level redundancy, while the LSB detail stream is compressed using learned image compression to represent residual variations and fine-grained details. Experiments on four MRI and CT datasets show that the proposed method consistently outperforms representative traditional and learning-based codecs, achieving the lowest bit rate across all datasets. Meanwhile, it preserves high reconstruction fidelity. As a downstream application, we further demonstrate that the compressed bitstreams can be effectively integrated with DNA encoding and converted into sequences with favorable biochemical properties.

15
AI-Discovered Cognitive Models Reveal Novel Insights into Human and Animal Learning

Kasenberg, D.; Castro, P. S.; Eckstein, M. K.; Elteto, N.; Dabney, W.; Wang, C. L.; Engelcke, M.; Mohanta, R.; Dev, A.; Botvinick, M. M.; Tomasev, N.; Turner, G. C.; Costa, V. D.; Daw, N. D.; Stachenfeld, K. L.; Miller, K. J.

2026-05-21 animal behavior and cognition 10.64898/2026.05.18.725921 medRxiv
Top 0.6%
0.8%
Show abstract

Scientific models are widely used across the natural sciences as an interface between scientific theories and empirical data [1]. Such models play a key role, for example, in the study of human and animal learning, where they express algorithmic hypotheses and relate them to psychology and neuroscience data [2, 3]. These models are traditionally handcrafted by expert researchers based on existing theory or new insights. Such handcrafted models, however, are now known to fall short of capturing the full richness of behavior, even in their narrow domains [4-7]. An alternative data-driven approach has emerged, seeking to discover new insights by fitting and interpreting flexible models [8-11]. However, these tools require substantial human effort to derive insight from data, and it has been unclear how to discover new ideas from data efficiently. Here, we present DataDIVER, a general approach for automatically discovering computational models from data, and demonstrate that these models surface novel mechanistic insights into human and animal learning. Our approach delivers models that take the form of short computer programs, which are optimized both to fit data well and to be simple. These programs explicitly connect with existing theoretical frameworks and are readily understandable by human scientists. They can also be used to make novel predictions, some of which we show are borne out in re-analysis of existing data. General-purpose tools for surfacing new ideas from data, especially in combination with the large datasets that are increasingly available in many fields, stand to dramatically accelerate scientific discovery.

16
MeiCOfi: Meiotic CrossOver Finder in haploid, diploid, polyploid and hyper-recombinant genomes

Fuentes, R. R.; Fernandes, J. B.; Susanto, T.; Wang, Y.; Underwood, C. J.

2026-05-04 bioinformatics 10.64898/2026.04.29.721680 medRxiv
Top 0.7%
0.7%
Show abstract

During the meiotic cell division, homologous chromosomes pair and recombine, leading to large reciprocal exchanges of genetic information. In most species, meiotic crossovers (COs) are crucial for normal chromosome segregation and they generate genetic diversity, which can be acted upon by natural selection in wild populations or by breeders to combine desirable traits in a genome. Identifying the position and frequency of COs is therefore essential in both classical genetics studies and breeding programmes. However, a computational tool capable of accurately detecting COs across diverse contexts, including varying marker densities, genome size and structure, recombination rate, and ploidy, remains lacking. We developed MeiCOfi (Meiotic CrossOver Finder) to detect meiotic crossover events at high-resolution from low-coverage genome sequencing data. We evaluated it using data from Arabidopsis thaliana, rice, barley and both intra- and inter-specific tomato hybrids, encompassing a wide range of genome complexities and marker densities. It reliably detects crossovers in hyper-recombinant A. thaliana with up to 62 CO per backcross offspring and in haploid gametes from barley with sequencing coverage as low as 0.1x. It can identify crossovers in polyploid genomes, including simulated recombinant tetraploids and also real data from tetraploid tomato hybrid offspring. Our results demonstrate that MeiCOfi can robustly identify crossovers in diverse genomic contexts.

17
Benchmarking Static Gene Regulatory Network Reconstruction and Dynamic Transition Probing in Single-Cell Foundation Models.

Ye, z.; Yang, N.; Yang, X.; Mao, X.; Tang, C.

2026-05-20 systems biology 10.64898/2026.05.17.725083 medRxiv
Top 0.7%
0.7%
Show abstract

Single-cell foundation models may encode gene regulatory information, but it remains unclear which model components capture this signal and how it compares with conventional inference methods. Here, we introduce a unified benchmark that evaluates gene regulatory network (GRN) reconstruction from six single-cell foundation models and three classical baselines across six datasets and four reference network types. We disentangle three sources of regulatory signal within each model--pretrained token embeddings, final-layer hidden states, and attention-derived scores. Under a strict zero-shot setting, scGPT token-embedding similarity outperforms classical baselines on STRING and ChIP-seq references, recovers core transcription factors, and best preserves reference network topology. Moreover, static GRNs cannot test whether learned gene-gene relationships are predictive of expression dynamics, we therefore introduce dynamic transition probing, which iteratively applies a models reconstruction head to drive early-cell profiles toward late-cell states without temporal supervision. We find pretrained models capture meaningful developmental transitions, with scFoundation showing the strongest overall performance. Together, our results show that single-cell foundation models encode transferable regulatory and dynamical priors, but how well these priors can be recovered depends on model architecture, pretraining design, and extraction strategy.

18
Corpus-wide causality: Algorithm design & application for aggregating gene-disease causal evidence

Bansal, N.; Parsodkar, A. P.; Pathak, A.; Narayanan, M.

2026-05-12 bioinformatics 10.64898/2026.05.08.723796 medRxiv
Top 0.7%
0.7%
Show abstract

Identifying causal relationships and distinguishing them from associations is a central scientific endeavor with many applications; knowing causal links between genes and diseases, for instance, can focus drug discovery on curing diseases beyond just symptom management. Despite several studies on automatically extracting relations between entities from large biomedical literature corpora like PubMed, only a few studies extract causal relations from abstracts and even fewer summarize corpus-level evidence for causal links. Recently, Large Language Models (LLMs) have been increasingly deployed to summarize biomedical information and extract relations; however, there is a distinct lack of explicit benchmarking comparing these generalized LLM-based methods against specialized, domain-aware frameworks for corpus-wide causal inference. In this work, we develop a method to infer Corpus-Wide Causal Score (CWCS) of a gene-disease (G-D) pair by integrating two pieces of evidence: (i) network-based causal signals in a prior gene regulatory network, quantified as a CWCS-Net score using an existing multilayer network centrality algorithm; and (ii) corpus-wide literature evidence, quantified as a CWCS-TD (TD for Truth Discovery) score using a newly-developed TD algorithm. Our CWCS-TD (scoring) algorithm jointly and iteratively estimates causal scores for multiple G-D pairs while modeling the reliability of PubMed abstracts co-mentioning them; and represents an advance in the field of TD algorithms due to its incorporation of bibliometric features of publications to address the challenge of sparsity of abstracts that assert a G-D causal relation. Using OMIM as an external expert-curated reference to evaluate classifications of G-D pairs as causal or not, our CWCS method achieved a causal class F1 score of 0.600 across ten diseases, outperforming both LLMs, GPT-4o and MMed-Llama 3 (this performance trend also persists when using area under the precision-recall curve as the evaluation metric). Both LLMs exhibit high recall accompanied by comparatively low precision, resulting in lower causal class F1 scores (0.505 for GPT-4o and 0.522 for MMed-Llama 3) due to large number of false positive predictions. Taken together, these evaluations and other ablation studies show the promise of our carefully designed algorithm in collating and integrating evidence of biomedical causal relations from both network- and literature-based sources, thereby supporting its broader applicability.

19
HiCPEP: Efficient estimation of chromatin compartment PC1 from Hi-C covariance structure

Cheng, Z.-R.; Chang, J.-M.

2026-05-18 bioinformatics 10.64898/2026.05.14.725269 medRxiv
Top 0.8%
0.7%
Show abstract

Principal component analysis (PCA) of the Hi-C Pearson correlation matrix is the standard approach for identifying A/B chromatin compartments. Despite its widespread use, the relationship between the first principal component (PC1) and the underlying compartment structure remains insufficiently characterized, and computing PC1 can become computationally expensive for high-resolution Hi-C data. Here we investigate the role of the PC1 explained variance ratio in compartment analysis and show that chromosomes with strong compartment organization typically exhibit a dominant PC1 signal. Based on this observation, we propose HiCPEP, a heuristic algorithm that estimates the sign pattern and relative magnitude of PC1 directly from the Hi-C Pearson covariance matrix without performing explicit eigenvector decomposition. The method can operate from either a dense Pearson matrix for fast approximation or a sparse observed/expected (O/E) matrix to reduce memory usage. Furthermore, because many covariance columns exhibit PC1-like patterns when the compartment signal is strong, HiCPEP can be accelerated using random sampling without substantially reducing accuracy. Across multiple Hi-C datasets, HiCPEP consistently recovered compartment patterns with high similarity to reference PC1 vectors produced by standard PCA-based methods. Benchmark experiments show that HiCPEP achieves comparable accuracy while reducing computational cost in terms of runtime or memory usage. These results suggest that HiCPEP provides a practical alternative for efficient chromatin compartment analysis from large-scale Hi-C datasets. The HiCPEP implementation is freely available at https://github.com/ZhiRongDev/HiCPEP.

20
Cosine Similarity Conflates Clinically Distinct Cancer Variants: A Case for Typed-Graph Retrieval in Precision Oncology Decision Support

Khan, U. A.

2026-05-11 bioinformatics 10.64898/2026.05.05.723102 medRxiv
Top 0.8%
0.6%
Show abstract

Retrieval-augmented generation (RAG) is increasingly applied to clinical decision support in oncology, where treatment selection depends on identifying a patients specific somatic variant from an NGS report and matching it to evidence-graded therapy options. The vector retrieval that underlies most RAG systems uses cosine similarity over text embeddings, an architecture optimized for linguistic proximity rather than entity-level identity. We hypothesize that cosine-similarity-based retrieval conflates clinically distinct cancer variants at clinically relevant rates, while a typed-graph approach in which each variant is a discrete node preserves variant-level identity by construction. We evaluated 9 cancer variant pairs known to have differential FDA-approved therapy indications, with variant identity informed by the CIViC clinical variant evidence database and primary clinical literature. Variant pairs included BRAF V600E vs V600K (melanoma), EGFR L858R vs T790M (NSCLC, the canonical sensitivity-vs-resistance pair), EGFR exon 19 deletion vs L858R, KRAS G12C vs G12D (only G12C has FDA-approved targeted therapy), KRAS G12C vs G12V, ERBB2 amplification vs activating mutation, two PIK3CA hotspot pairs, and NTRK1 fusion vs point mutation. We computed pairwise cosine similarity for each variants text representation across three open-source embedding models (PubMedBERT, MedCPT, BGE-large-en-v1.5) and three text formats (short, medium, long). Across the medium format (gene + variant + tumor type), 100% of clinically distinct variant pairs (9/9) had cosine similarity [≥] 0.95 under both biomedical encoders (PubMedBERT, MedCPT). The general-purpose encoder (BGE-large-en-v1.5) showed lower conflation in the medium format (11%) but rose to 100% with added clinical context. At the more stringent {tau} = 0.99 (averaged across formats), PubMedBERT conflated 56% of pairs and MedCPT conflated 22%. The biomedically pre-trained encoders performed worse, not better, than the general-purpose encoder. The typed-graph baseline achieves zero conflation by construction. We discuss the architectural implications: vector retrieval is appropriate for unstructured literature search but introduces unsafe ambiguity when used as the substrate for variant-level reasoning that drives drug-selection decisions. We argue that typed-graph retrieval should be the default architecture for any retrieval-grounded clinical decision support system that recommends targeted therapy.